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Abstract 

Background: Based on available biological information, genomic data can often be partitioned into pre-defined sets 
(e.g. pathways) and subsets within sets. Biologists are often interested in determining whether some pre-defined sets 
of variables (e.g. genes) are differentially expressed under varying experimental conditions. Several procedures are 
available in the literature for making such determinations, however, they do not take into account information 
regarding the subsets within each set. Secondly, variables (e.g. genes) belonging to a set or a subset are potentially 
correlated, yet such information is often ignored and univariate methods are used. This may result in loss of power 
and/or inflated false positive rate. 

Results: We introduce a multiple testing-based methodology which makes use of available information regarding 
biologically relevant subsets within each pre-defined set of variables while exploiting the underlying dependence 
structure among the variables. Using this methodology, a biologist may not only determine whether a set of variables 
are differentially expressed between two experimental conditions, but may also test whether specific subsets within a 
significant set are also significant. 

Conclusions: The proposed methodology; (a) is easy to implement, (b) does not require inverting potentially 
singular covariance matrices, and (c) controls the family wise error rate (FWER) at the desired nominal level, (d) is 
robust to the underlying distribution and covariance structures. Although for simplicity of exposition, the 
methodology is described for microarray gene expression data, it is also applicable to any high dimensional data, such 
as the mRNA seq data, CpG methylation data etc. 



computational methods have therefore been developed 
for such analyses. Although the methods described in this 
paper are broadly applicable to any high dimensional data 
where the sets and subsets are pre-defined, for simplicity 
of exposition, we shall describe the methodology in the 
context of gene expression data. The available gene set 
analysis (GSA) methods can be broadly classified into two 
categories. Loosely speaking, the first category of meth- 
ods, often referred to as competitive gene set methods, 
tries to answer questions such as "Given the collection 
of differentially expressed genes identified by a statisti- 
cal/bioinformatics methodology, how enriched is a pre- 
specified set?" For example, suppose Si and 52 are two 
pre-specified sets consisting of Ni and A/2 genes respec- 
tively. Suppose an investigator identified a total of n genes, 
«( of which belong to set Si, i = 1, 2. Then this category of 
methods computes the probability of discovering «i(«2) 



Background 

With the advent of high dimensional genomic data, 
researchers are able to study changes in the expression 
of several hundreds and thousands of variables such as 
genes or CpG's under various experimental conditions (or 
phenotypes) in a given cell culture, tissue or an organ- 
ism etc. Although identification of diff^erentially expressed 
individual variables across experimental conditions is of 
general interest, in recent years there is considerable inter- 
est in analyzing sets of variables that belong to some 
pre-specified biological categories such as signaling path- 
ways and biological functions. Numerous statistical and 
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or more genes from the set 51(52). Several variations and 
innovations to Fisher's exact test, Kolmogorov-Smirnov 
test, etc, have been proposed in the literature for obtaining 
the corresponding /7-values (c.f. [1-5]). Software packages 
such as Ingenuity Pathway Analysis (IPA) report values 
using such tests. The second category of methods answers 
a different but equally important question (c.f. [4,6-11]), 
namely, "Is a given set of genes differentially expressed 
between two conditions?" In this category of methods the 
gene set information is directly used when selecting differ- 
entially expressed sets of genes between two experimental 
conditions and the question it answers has a clear bio- 
logical meaning. Commonly this category of methods is 
referred to as self-contained methods, which is the focus 
of this paper. 

Most earlier methods (belonging to either of the two 
categories described above) are based on univariate statis- 
tical tests and thus ignore the underlying dependence in 
the gene expression data (c.f [1-5,9,12-14]). For a review 
one may refer to [11,15,16]. It is well known that univariate 
statistical methods for multivariate data may potentially 
increase false positive rate and/or decrease power [17]. 

A natural multivariate extension of the classical t-test 
is the Hotelling's test which can be used for compar- 
ing a set of genes between two experimental conditions. 
Consequently, several GSA methods using Hotelling's 
test have been proposed in the literature such as [18- 
20]. Intrinsically, the Hotelling's statistic requires the 
sample size to be larger than the number of variables. 
However, for GSA, it is common for the sample size 
to be much smaller than the number of genes in a set. 
As a consequence, the Hotelling's statistic is not 
uniquely defined. To deal with the singularity problem, 
several approaches have been proposed in the literature. 
For instance, Kong et al. [8] modified the Hotelling's 
statistic by replacing the inverse of sample covariance 
matrix by its Moore-Penrose inverse based on the first few 
eigenvalues. Although this procedure is appealing, there 
is arbitrariness in the choice of number of eigenvalues 
to be used. Recendy [11] introduced a shrinkage based 
Hotelling's statistic by replacing the sample covariance 
matrix by a shrinkage estimator of the covariance matrix 
derived in [21]. Although such modifications are compu- 
tationally more stable than the Hotelling's T^, for large 
gene sets (i.e. sets with a large number of genes), they 
still pose computational challenges. It is because that the 
test statistic involves the inversion of a high dimensional 
covariance matrix even though it may be non-singular. 
Lastly, all multivariate methodologies described above 
implicitly assume that the gene expression data in the 
two experimental conditions are homoscedastic across all 
genes. That is, for a given set of genes the covariance 
matrix of gene expression in the two groups is identi- 
cal. This, in our opinion, is a very restrictive assumption 



and may be hard to verify in practice when dealing with 
microarray data consisting of several thousands of genes. 

To gain deeper understanding of the differences 
between the two experimental/test groups (e.g. cancer 
and normal patients), there is considerable interest in 
identifying not only sets of genes involved in a pathway 
or a biological process, but also in identifying subsets of 
genes belonging to a particular biological process within 
each significant set. For example, genes in the Vascular 
Endothelial Growth Factor (VEGF) pathway are impor- 
tant for angiogenesis. There are about 31 genes in this 
pathway that are involved in various biological processes. 
These 31 genes can be further partitioned into different 
subsets of biological functions and the biologist may be 
interested in discovering not only the VEGF pathway 
but also various subsets of genes within this pathway. 
For example, MAP2K3, MAP2K6, p38, MAPKAPK2, 
MAPKAPK3, and HSP27 are involved in Actin reorgani- 
zation, FAK and Paxillin are involved in Focal Adhesion 
Turnover, whereas GRB2, SHC, SOS, Ras, Rafl, MEKl, 
MEK2, ERKl, and ERK2 are involved in gene expression 
and cell proliferation. Similarly, other genes in VEGF 
pathway are involved in various other biological pro- 
cesses, such as cell survival, vascular cell permeability, 
prostaglandin production, and nitric oxide production. 

In examples such as the above, we may (i) be inter- 
ested in using the additional information about the sub- 
sets to improve the power of detecting gene sets (such 
as the VEGF pathway), and (ii) not only be interested in 
knowing if genes in the VEGF pathway are differentially 
expressed between control and treatment group, but also 
interested in identifying subset of genes in biological pro- 
cesses within VEGF pathway that are also differentially 
expressed between the two groups. Methods described 
above and other multivariate statistical methods, such as 
the methods based on principal component analysis [7], 
the mixed effects logistic regression [6], analysis of covari- 
ance [10,22], are not designed to address such questions 
directly. If one ignores information regarding subsets, 
then there is not only a loss of biological information when 
interpreting the data, but also a potential loss in power. On 
the other hand, one may use the existing methods by tak- 
ing the subsets as the unit of analysis rather than the sets. 
However, such a strategy destroys the underlying relation- 
ships among subsets within a set and consequently may 
result in loss of power. 

In this paper we introduce a novel methodology that 
(a) is computationally simple and does not require inver- 
sion of any matrix, (b) exploits the underlying dependence 
structure, (c) is useful for identifying significant gene sets 
and subsets within each significant set, (d) controls the 
overall familywise error rate (EWER) at the desired nom- 
inal level, and (e) is robust to potential heteroscedasticity 
in the data. 
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The basic idea of the proposed method is rather sim- 
ple. Using the available biological knowledge, we partition 
the sets of genes into various subsets within sets. Within 
each gene subset so obtained, we perform a variation 
of Hotelling's test and calculate the corresponding 
p-value using bootstrap methodology. We then perform 
multiple testing corrections using Bonferroni method for 
controlling the FWER. To control the FDR, the proposed 
methodology can be easily modified by using Benjamini- 
Hochberg (BH) procedure [23] to replace the Bonfer- 
roni method. Using extensive simulations, we studied 
the performance of the proposed procedure in terms of 
power and the FWER control. We illustrate the proposed 
methodology using a recently published data of [24]. 

Methods 

Notations 

Suppose we are interested in comparing two experimental 
conditions on the basis of mean expression levels of genes 
belonging to K pre-specified sets of genes Si, S2, . ■ ■ Sk- 
For instance, these gene sets may represent different path- 
ways or biological functions, derived from databases such 
as GO, KEGG, IPA, etc. Furthermore, suppose each gene 
set Si;, k = 1,2, .. .,K, is & union of mi; pre-specified sub- 
sets Si; J, • ■ ■ ,Si;^rni; such that Sk = U/l*i'^*^.!- Note that 
Sk,i n ^k,j is not necessarily an empty set for any i ^ /. 
Suppose there are a total of G genes on the microarray 
and suppose Xy is a G x 1 random vector correspond- 
ing to the sample, / = 1, 2, ... , m, in the group, 
/ = 1, 2 with mean vector E(Xij) = fi, and covariance 
matrix Cov(X,y) = X/, where = (fj^n, . . ., tiic)' , i = 1,2. 

For set S^, we are interested in testing the following 
null and alternative hypotheses; Hk : /i^^ = /A2,A: versus 
H'k '■ I'-hk ^ l^2,k> where /t,- = (/U.,,/ : / e S^) denotes the 
mean vector of genes in the set Sk for samples from the jth 
group, i = 1, 2. Similarly, for genes belonging to the subset 
^k,/ C the hypotheses of interest are, : fii_kj = fi2,kj 
versus H'^ . : fi^j;. # fiy;., where fi^. = (fiij : / € Sk,j) 
denotes the mean vector of genes in the subset Sk,j for 
samples from the ith group, / = 1, 2. 

The test statistic and its null distribution 

We shall now describe the test statistic using a generic 
notation. Suppose, for i = 1, 2, X,i, X,-2, • • • , X(„; is a ran- 
dom sample of G x 1 vectors from a common population 
with mean vector fii and covariance matrix X,-. Let X; 
denote the sample mean vector corresponding to the i''^ 
population, i = 1, 2, and let S denote the usual pooled 
sample covariance matrix. Samples randomly drawn from 
these two populations are independent. Then under the 
assumption of Zi = T.2, the Hotelling's statistic is pro- 
portional to (Xi — X2)'S~^(Xi — X2). For large values of 
G, statistics such as the Hotelling's T-^ and Fisher's linear 



discriminant function can be unstable since they involve 
the inversion of a high dimensional covariance matrix S. 
In the context of discriminant analysis [25], it was sur- 
prisingly found that the linear discriminant function that 
ignored the off-diagonal elements of S performed better 
than Fisher's linear discriminant function that used the 
entire matrix S. In addition, in practice it may not be 
suitable to assume that Zi = S2. Motivated by these rea- 
sons, we use the following test statistic for testing the null 
hypotheses described in the above subsection: 

(1) 

where Diag(Si) is a diagonal matrix containing the 
diagonal elements of the sample covariance matrix S/, 
/ = 1, 2. 

Since the underlying gene expression data are not nec- 
essarily multivariate normally distributed and the covari- 
ance matrices of these two groups are potentially unequal, 
the exact distribution of the above test statistic under the 
null hypothesis cannot be determined easily. We there- 
fore adopt bootstrap methodology for simulating the null 
distribution of the test statistic such that the result- 
ing methodology is not only robust to heteroscedasticity 
but also preserves the underlying dependence structure 
among genes. To do so, we draw simple random sam- 
ple (with replacement) of «; subjects from the group, 
/ = 1, 2 and construct the bootstrap data using the resid- 
uals 6*. = Xij — Xi, i = 1, 2, j = 1,2,..., rii from the 
resampled subject /. Thus the bootstrap data are given 
by X*. = :^ + i = 1, 2, / = 1,2,...,«,-, where 

X = "^^jlj]^^^^ , the weighted average of the two sam- 
ple means, and 6* is the residual corresponding to the 

subject selected. For more details regarding the resid- 
ual bootstrap methodology we refer the reader to [26,27]. 
It is important to recognize that the residual bootstrap 
methodology implemented here is different from the usual 
bootstrap methodology. The standard bootstrap may not 
honor the structure present in the data and hence may 
potentially result in an inflated false positive rate. We 
remark that our proposed test statistic resembles the 
test statistic of [9], in the sense that neither procedure 
uses the off diagonal elements of the estimated covari- 
ance matrices. The two procedures, however, differ in the 
denominators used in the test statistic. The proposed test 
allows for unequal variances in the two populations that 
are being compared. Secondly, and more importantly, the 
two procedures fundamentally differ in the resampling 
schemes used. As noted above, the proposed method- 
ology bootstraps the residuals and thus allows for any 
underlying dependence structure in the data (unknown 
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to the user) whereas the resampUng scheme used in [9] 
intrinsically assumes that under the null hypothesis the 
two populations under comparison are identically dis- 
tributed, which is often not the case in practice. This 
is a major and an important difference between the 
two methods. 

The proposed strategy 

For each S/^j c S^, let the test statistic (1) be denoted 

by 

"^dia^.k.j corresponding bootstrap j?-value 

be denoted by Pi^j. If we have only a single gene-set 
with gene-subsets Sk,i, ■ ■ ■ tS/^^mi^, then within the 
problem of testing the significance of Sk_i, ■ ■ ■ , Sk^nn^ is for- 
mulated as a problem of simultaneously testing a family 
of rrn: null hypotheses, = {Hk,i, ■ ■ ■ ,H]^,mk)> using the 
available jj-values Pkj'^- The gene-set Sk is declared to be 
significant if and only if at least one H^^j is rejected in the 
above problem of multiple testing. 

There are two popular notions of type I error rates 
when dealing with the problem of simultaneously testing 
multiple hypotheses, one is to control FWER, which is 
the probability of falsely rejecting at least one true null 
hypothesis, and the other is to control the FDR, which 
is the expected ratio of false rejections to the total num- 
ber of rejections [23]. In this article we shall only describe 
methods to control the FWER. 

There are several FWER controlling procedures avail- 
able in the literature for testing the family of null hypothe- 
ses, J-yt = {Hk^i,-- - iHk^m^]. In this paper we con- 
sider the following Bonferroni based procedure: For a 
given set of null hypotheses Fk, we reject Hi^j e Fk 
if Pk,j < ajnik- The corresponding Bonferroni-adjusted 
/;-value for the set of null hypotheses Ty^ \& T^^. = 
mm{mi(Pl(j,j = 1, . . . , mk). Similarly, if we have multiple 
gene-sets Sk,k = 1 . . .,K, each of which having Wjt gene- 
subsets Sk^i, - ■ ■ ,5^,m/t> then the problem of testing the 
significance of all gene-subsets in the K gene sets is for- 
mulated as a problem of simultaneously testing K families 
of null hypotheses, = {W^^.i, • • • >Hk,mi^}>k = 1,...,K, 
using the available /^-values P/^j, . . . ,Pk,mitt k = 1, . . . ,K, 
in which for each gene-set S]^,k = 1, . . . , K, it is declared 
to be significant if and only if at least one hypothesis Hj^j 
in Fk is rejected. 

For testing the K families Fk, k = 1, . . . , /C, a simple 
Bonferroni based strategy is proposed as follows. 

The Procedure 

Step 1. Compute raw residual bootstrap p -value for 

each subset of genes. 
Step 2. Compute adjusted p -values for each set 

(adjusting for the number of subsets within the 

set) as described above. 
Step 3. Declare a set 5^^ to be significant if its adjusted 

p-value is less than a/K. A subset 5;^,; within the 



set Sk is declared to be significant if its raw 
p-value Pkj is less than a/Kmk. 

It is easy to see that the above proposed procedure 
strongly controls the overall FWER for any dependent test 
statistics, the probability of falsely rejecting at least one 
true null hypothesis in some family. 

When the number of gene sets and gene subsets is large, 
it might be preferable to control the FDR rather than the 
FWER. The above proposed testing strategy controlling 
the FWER can be easily modified to control the FDR by 
using the BH procedure to replace the Bonferroni proce- 
dure when simultaneously testing the significance of the 
gene sets. Such modified strategy is very similar to a two- 
stage test strategy developed in [28] for controlling the 
overall FDR while selecting significant gene sets and their 
significant individual genes. 

Simulation study 

We evaluate the performance of the proposed method- 
ology in terms of power (the probability of rejecting at 
least one false null hypothesis) and the FWER control with 
Tsai and Chen's method in [11], which uses the shrink- 
age estimator of the sample covariance matrix proposed 
in [21]. Note also that, unlike the bootstrap residuals used 
in the proposed methodology for deriving the null distri- 
bution of the test statistic, the resampling scheme used in 
[11] resembles the scheme used in [9]. Such resampling 
schemes do not honor the differences (if any) in depen- 
dence structure of the two populations that are being 
compared. Thus, if the two populations have different 
covariance structures under the null hypothesis, then as 
stated earlier in this paper, the standard permutation or 
standard bootstrap methodology can potentially result in 
an inflated FWER. 

Study design 

In the simulation study, we considered two patterns of 
total number of sets of genes, which were 5 and 10. 
Since, in practice, the number of subsets and the num- 
ber of genes within a subset may be unknown a priori, 
we allowed the number of subsets within each set of 
genes to be uniformly distributed in the range 5 to 16 and 
the number of genes within each subset was generated 
according to a uniform distribution in the range 5 to 10. 
To understand the robustness of the two methods in terms 
of FWER control, we considered a variety of probability 
distributions for the gene expression as follows: 

(1) Multivariate normal distribution, of appropriate 
dimension, with mean vectors 0 (for the control 
group) and //. (for the treatment group), and 
covariance matrices Xi (for the control group) and 
1^2 (for the treatment group), respectively. As 
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commonly done, we assumed intra-class correlation 
structure for the two covariance matrices, with 
variances a^, cri and correlation coefficients pi and 
p2, respectively. We considered two cases, namely, 
fi = o'2> Pi = P2 (homoscedastic or homo.) and 
fi / a'2, pi 7^ P2 (heteroscedastic or hetero.). In 
practice, one never knows a priori whether we have 
homoscedasticity or heteroscedasticity. Since genes 
within each subset (whether control or treated 
groups) may have different variances, for each gene 
we let (Ti and (72 both take one of the five values, 0.1, 
0.5, 1, 1.25 or 1.5, at random. Similarly, correlation 
coefficient between a pair of genes may not 
necessarily be constant across all subsets (whether 
control or treated groups), for each subset we let pi 
and p2 both take one of the five values, 0, 0.25, 0.5, 
0.75 or 0.9, at random. Thus the variance and 
correlation coefficients vary randomly from subset to 
subset. For each subset of genes, we always let mean 
vector = 0 for the control group, /i = 51 for the 
treatment group, where S was taken to be 0.5, 1 or 1.5 
and 1 = (1, 1, • • • , 1)'. For each group, we considered 
two patterns of sample sizes, namely, 10 and 40. 

(2) Multivariate log normal distribution, where the 
vector of natural logarithm of each component 
follows multivariate normal distribution, with 
parameters as defined in the above setting of 
multivariate normal distribution, with Ei = E2. 

(3) Multivariate beta distribution. This distribution is 
motivated by CpG methylation data. Within each 
treatment group the multivariate beta vector was 
generated as follows. To generate p dimensional beta 
variable, we randomly generated p independent 
chi-square random variables Ui, U2, ■ ■ - Mp with 
either 4 or 5 degrees of freedom and generated an 
additional independent chi-square random variable 
V with either 1 or 2 degrees of freedom. The 
resulting multivariate beta type random vector for a 
given treatment group is defined as 

Z = (Zi, Z2, . . ..ZpY, where Z; = Lf;/(t// + V), 
i= l,2,...,p. With the above choices of degrees of 
freedom, the mean methylation values (commonly 
called the "beta" value) for our simulated CpG's 
ranged from approximately 0.67 to about 0.83. 

(4) Mixtures of multivariate normal random vectors. For 
each treatment group we generated mixture of 
multivariate normally distributed data Z as follows: 

Z ~ 7tN(0, Si) + (1 - 7t)N(l, E2). 

where n = 0.2. As in the case of multivariate 
normally distributed data in (1), we considered the 
homoscedastic as well as heteroscedastic covariance 
matrices for normal vector. The patterns of 
covariance matrices are as described in (1) above. 



All our simulation results are based on a total on 1,000 
simulation runs and 5,000 bootstrap samples. 

Results 

In Table 1 we summarize the simulated FWER of the 
proposed Bonferroni method and the TS method. In all 

patterns considered the FWER of the proposed test was 
closer to the nominal level except in one case where the 
estimated FWER exceeded the nominal of 0.05 by one 
standard error. This corresponded to the mixtures of mul- 
tivariate normal distributions case. On the other hand, as 
expected in the case of heteroscedastic data, the estimated 
FWER of the TS method often exceeded the nominal level 
of 0.05 by at least one standard error (which is approx- 
imately 0.007). Such cases are represented in bold face 
values. It is also interesting to note that the TS method 
was extremely conservative in the case when m=10 and the 
number of sets was 10. Although the shrinkage estima- 
tor of the covariance matrix is known to perform well for 
large p (the number of genes) and small n paradigm, in the 
present context as the number of sets of genes increases, 
the test statistic appears to be very conservative. This phe- 
nomenon is very striking when comparing the powers of 
the two tests (Table 2). The difference between the pro- 
posed method and TS is very noticeable especially when 
S is close to the value assumed in the null hypothesis and 
when n =10, and the number of subsets is also 10. As 
we depart away from the null hypothesis, the TS method 
catches up with proposed test and there is very little dif- 
ference between the two methods in terms of power for 
alternatives away from null hypothesis. 

We also compared the performance of the proposed 
procedure based on (1) with that based on the following 
Hotelling's type statistic which uses the entire sample 
covariance matrices Si and S2, 

r2 = (Xi-X2)'f- + -) \x1-X2). (2) 

V«l "2/ 

To ensure that the sample covariance matrices are non- 
singular, we chose the sample size in each group to exceed 
the total number of genes in each subset. In Table 3 
we provide a small representative sample of simulation 
results. As can be seen from Table 3, the proposed pro- 
cedure based on test statistic (1) has far greater power 
than the corresponding test based on (2) that uses the full 
sample covariance matrices. These findings, in the context 
of statistical testing, are consistent with [25] who discov- 
ered a similar phenomenon in the context of discriminant 
analysis for high dimensional data. 

Illustration 

Intramuscular injections among children often result in 
a variety of problems ranging from minor discomforts 
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Table 1 The simulated FWERs of the proposed method and the Tsai-Chen's method at level a = 0.05 



Distribution 


Variance 


Sample 


Number 


Proposed 


Tsai-Chen's 




structure 


size 


of sets 


method 


method 


Normal 


Homo. 


n=10 


5 


0.027 


0.019 








10 


0.032 


0.027 


Normal 


Homo. 


n=40 


5 


0.047 


0.054 








10 


0.047 


0.061 


Normal 


Hetero. 


n=10 


5 


0.036 


0.031 








10 


0.044 


0.021 


Normal 


Hetero. 


n=40 


5 


0.038 


0.066 








10 


0.052 


0.087 


Log-Normal 


Homo. 


n=10 


5 


0.027 


0.018 








10 


0.024 


0.022 


Log-Normal 


Homo. 


n=40 


5 


0.048 


0.039 








10 


0.050 


0.062 


Mix. Normal 


Homo. 


n=10 


5 


0.018 


0.009 








10 


0.020 


0.005 


Mix. Normal 


Homo. 


n=40 


5 


0.055 


0.050 








10 


0.050 


0.054 


Mix. Normal 


Hetero. 


n=10 


5 


0.018 


0.003 








10 


0.017 


0.003 


Mix. Normal 


Hetero. 


n=40 


5 


0.058 


0.060 








10 


0.049 


0.057 


Multi. Beta 


Var. func. mean 


n=10 


5 


0.033 


0.027 








10 


0.031 


0.031 


Multi. Beta 


Var. func. mean 


n=40 


5 


0.043 


0.042 








10 


0.053 


0.042 



such as, rash and pain, to more serious compUcations 
resulting in emergency room visits [24]. Ferre et al. [24] 
conducted a gene expression study on a sample of 10 
piglets to evaluate the effect of intramuscular injections 
on gene expression. Gene expressions were obtained at 
baseline, 6 hours, 2 days, 7 days and 21 days after injec- 
tion. For details of the study design one may refer to [24]. 
To illustrate the proposed Bonferroni-based methodology 
•we compared the mean expression of genes on day 7 with 
their mean expression at baseline. In all, there were 1,908 
probes on the cDNA chip. Since the data on one of the 
pigs was missing for day 7, we only used data from 9 pigs 
in our paired analysis, where the expression (1) is suit- 
ably modified to reflect paired data. Using IPA we mapped 
these 1,908 probes onto 1,195 genes describing 75 biolog- 
ical categories. In Additional file 1: Table SI (see online 
Supplementary Materials) we list all 75 biological cate- 
gories along with their sub-categories. Note that the gene 
names and biological categories obtained from IPA are 
only meant for illustrating our methodology. 



According to our Bonferroni-based methodology, 36 
out of 75 biological categories are significant at FWER 
level of 0.05 (see Additional file 2: Table S2 in the online 
Supplementary materials). In Additional file 3: Table 
S3 in the online Supplementary Materials, we list sub- 
categories within each category along with their Bonfer- 
roni adjusted /"-values. Results of the pathological exam- 
ination of the injured muscle on day 21 (post injury) 
conducted by [24] revealed formation of dense fibrous and 
collagenous tissue in the area of injection with regenera- 
tion and maturation of myocytes throughout the injected 
area. A scar with new myofibers and connective tissue 
were formed. Relative to baseline, their individual gene 
expression analysis of day 21 revealed significant differen- 
tial expression of genes such as coUagens, fibronectin and 
matrix metalloproteinase, etc. Interestingly, such genes 
are involved in biological categories such as Genetic 
disorder. Skeletal and muscular disorder. Protein Syn- 
thesis, Cell morphology. Connective tissue development, 
and Cellular development, which were all significant sets 
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Table 2 The simulated powers of the proposed method 
and the Tsai-Chen's method at level a = 0.05 for 
multivariate normally distributed data 
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according to our analysis (Additional file 2: Table S2). Our 
Bonferroni-based methodology allows a researcher to fur- 
ther probe the significance of each sub-category within 
the 36 significant categories. Results regarding the sig- 
nificance of each sub-category within each category are 
provided in the Additional file 3: Table S3 in the online 
Supplementary materials. 

Conclusions 

Since biologists are often interested in identifying a collec- 
tion of genes involved in a biological function or a pathway 
rather than individual genes, there has been considerable 
interest in recent years to develop statistical methods for 
identifying significant sets of genes. Usually, each path- 
way or biological function consists of a collection of (not 
necessarily disjoint) sub-pathways or sub-functions. Thus, 
each set of genes can be further partitioned into biologi- 
cally meaningful subsets of genes. In this paper we exploit 



Table 3 Power comparison of the suggested testing 
strategy based on test statistic (1 ) and (2) for 
homoscedastic case and the number of genes = 20 
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such structure information and propose a two-stage test 
strategy for selecting significant sets and subsets of genes 
between two experimental conditions while controlling 
the overall FWER. The proposed strategy is a general 
hierarchical test methodology, in which significant sets 
of genes are first identified by using Bonferroni proce- 
dure and then within each significant gene set, significant 
subsets of genes are further identified. 

Discussion 

Although we do not discuss the problem of selecting 
significant gene sets and subsets when comparing mul- 
tiple experimental conditions, the proposed methodol- 
ogy can be extended to such situations by replacing 
Hotelling's statistic by commonly used statistics such 
as the Hotelling-Lawley trace test or the Roy's largest 
root test. Furthermore, if the experimental conditions are 
ordered, such as in a time-course or a dose-response 
study, one can exploit order-restricted inference based 
methods developed in [29]. As commented by a reviewer 
of this manuscript, it is possible that in some applica- 
tions only a few genes in a given pathway are differentially 
expressed where such subsets are not necessarily pre- 
defined. We believe that discovery of such subsets could 
potentially generate interesting hypotheses for biologists 
to explore. The proposed methodology is targeted to 
identify pre-defined sets and subsets of genes that are dif- 
ferentially expressed, however, it would be interesting and 
useful to extend the proposed methodology to identify 
such unspecified subsets of genes. 

Additional files 



Additional file 1 : Table SI .Excel file containing gene sets, subsets and 
gene names. 

Additional file 2: Table S2. Excel file containing results of gene set 
analysis. 

Additional file 3: Table S3. Excel file containing results of gene subset 
analysis. 
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